Ingestion Settings

Optimized ingestion settings ensure your data is prepared for efficient search and retrieval by the LLM. You can adjust these settings to optimize data ingestion based on your specific use case and document type.

Scan Document for Images

This feature allows the system to generate descriptions for images found within your documents, making image content discoverable through search.

💡 Note: This feature is enabled by default.

An OCR (Optical Character Recognition) solution is used to extract text from images. This extracted text, along with generated image descriptions, enhances search capabilities by indexing visual content.

Select PDF Parser

💡 Note: This feature is currently available to selected customers who are granted early access. Please contact your sales representative if you wish to also receive early access. Capabilities and pricing for parsers are subject to change.

Airia uses different PDF parsers to extract content from your PDF documents. Selecting the correct parser ensures optimal data extraction and searchability, especially for complex layouts.

Basic: Default option. Optimized for simple documents such as plain-text PDFs and repetitive simple layouts.
Advanced: Optimized for images, math expressions, tables, scanned documents, and complex layouts.
Universal: Handles diverse layouts, handwritten notes, and noisy scans.

Edit the Selected Parser

You can change the PDF parser for your data source. Go to the option menu next to your data source and click Edit. From the edit screen, select a new parser. This new parser will be applied to all newly added or updated files within the data source after sync. To apply the new parser to all existing files, you must create a new data source with the desired parser setting.

Configure Text-to-SQL for Structured Data

Text-to-SQL allows you to interact with your structured data (specifically .csv and .xlsx files) using natural language queries, which are then translated into SQL.

When to Use Text-to-SQL

Use Text-to-SQL when you need to ask precise, qualitative questions about your structured data, such as:

“What is the revenue generated by product A for the year to date?”
“How many leads have we generated for the last year?”

How to Use Text-to-SQL

1. Set Up Your Data Source

Begin by setting up your data source with the relevant .csv or .xlsx files. The data source can also contain other file types.

2. Activate SQL Indexing

In the Ingestion settings for your data source, activate the SQL indexing option. For your .csv/.xlsx files, choose one of the following:

Semantic: When selected, only vectors will be generated for the structured files. This enables text search based on meaning and context. Choose this for semi-structured tabular data where natural language understanding is key.
💡 Example: For a survey documented in an Excel file with open-ended customer answers, use Semantic. Question: “What are the common complaints customers have about Agent Builder?”
SQL Only: When selected, the file will be indexed as SQL only, without enabling semantic search. Choose this for highly structured data where precise, quantitative answers are expected.
💡 Example: Question: “How many complaints are registered as High priority?”
Both: When selected, both vectors and SQL indexes will be generated for the structured files. This can enhance retrieval accuracy but will trade off speed and cost due to dual retrieval.
💡 Note: Both is the default option for Text-to-SQL setting. For all other file types within the same data source, only vector embeddings (semantic search) will be generated.

3. Checking Ingestion Status for Structured Files

For .csv/.xlsx files, you can monitor their ingestion status directly within the data source view. The status indicates the success of both SQL and vector indexing:

Ready: Both the SQL index and vector embeddings have been successfully created.
Failed: Both the SQL index and vector embeddings have failed to be created. You can check the reason for failure in the Failed files logs (indicated by a red button at the top of the page).
Partial: One of the two indexes (either SQL or vector) has failed, while the other was successful. Hover over the “Partial” status to see which specific index is ready and which has failed. The reason for the failed index can also be found in the Failed files logs.

4. Use in the Agent

In your Agent’s workflow, activate the Text-to-SQL retrieval option in the Data Source step. By default, this option is disabled, and the Data Source relies on Semantic retrieval. Enabling Text-to-SQL search will specifically query through .csv and .xlsx files from the connected Data Source.

💡 Example: To retrieve all sales records from an Excel file where sales exceed $5,000 and the date is within Q1 2025, a SQL query like SELECT * FROM sales WHERE amount > 5000 AND date LIKE '2025-01%' provides an efficient and precise solution by leveraging the file’s structured format.

💡 Hint: If you want to enable both Semantic and SQL search types (e.g., when your Data Source contains both .csv/.xlsx files and other file types, or if you chose the Both option for your structured files), you can drag and drop the Data Source step twice onto the canvas. Configure one copy to use Semantic retrieval and the other to use SQL retrieval, then connect both to the LLM.

Text-to-SQL Agent Settings

Model Selection

You need to select the LLM that will be used in the agentic workflow for Text-to-SQL. The LLM is fully responsible for SQL query generation. We recommend using “High Quality Capable” models to achieve stable and accurate results. Recommended models (tested):

High Quality (best performance):
- Claude 4 Sonnet
- GPT 4.1
- Claude 3.7 Sonnet
- GPT 4o
Sufficient Quality:
- GPT 4.1 mini
- Claude 3.5 Sonnet
- GPT 4o mini

Fuzzy Search

You can enable Fuzzy search to allow the system to search through records even if there are misspellings in the user’s query. Note that Fuzzy search can increase query generation complexity. When the Agent runs with the configured Data Source step, it will produce results based on the chosen settings. The Text-to-SQL retrieval agentic flow will output a structured result from the dynamically generated SQL query, based on the user’s natural language input. The choice between semantic retrieval and SQL retrieval for agents depends on the query type, data structure, scalability needs, and maintenance considerations. For structured files like .csv and .xlsx with precise, structured queries, SQL retrieval is preferred for its efficiency, accuracy, and ability to answer qualitative questions. For natural language queries or when dealing with text fields requiring semantic understanding, semantic retrieval is advantageous. In practice, combining both methods often provides the most flexible and effective solution, especially for agents interacting with users through natural language.

Configure Vector Database

The chosen Vector Database significantly impacts search capabilities, especially regarding hybrid search.

Available Options

Airia DB: This is the default vector database option. The proprietary database supports Hybrid search by default. If Hybrid search is turned off, only dense vectors will be generated and semantic search only will be available for the data source. This is the default vector database option.
Pinecone BYOK (Bring Your Own Key): Depending on the index you provide in your Pinecone database, it can enable Hybrid Search. If the index supports hybrid search (i.e., it’s configured for both dense and sparse vectors), Airia will, by default, generate both sparse and dense vectors in your Pinecone database to enable this capability. Required are Pinecone index name and API key.
Weaviate BYOK (Bring Your Own Key): Hybrid Search is always available with Weaviate. Weaviate applies Fusion algorithms for ranking results from both keyword (lexical) and semantic searches, enhancing relevance. You can learn more about fusion algorithms in the Weaviate blog. Required are Weaviate endpoint and API key.
Azure AI BYOK (Bring Your Own Key): Hybrid Search is always enabled by default. Azure AI does not support Fusion algorithms for ranking results. Required are AzureAI endpoint and API key.

Overview

Tools

OAuth App Registration

Data Source Connectors

Cloud Storage

Business Applications

Microsoft Office

Other Sources

Scan Document for Images

Select PDF Parser

Edit the Selected Parser

Configure Text-to-SQL for Structured Data

When to Use Text-to-SQL

How to Use Text-to-SQL

1. Set Up Your Data Source

2. Activate SQL Indexing

3. Checking Ingestion Status for Structured Files

4. Use in the Agent

Text-to-SQL Agent Settings

Model Selection

Fuzzy Search

Configure Vector Database

Available Options

Overview

Tools

OAuth App Registration

Data Source Connectors

Cloud Storage

Business Applications

Microsoft Office

Other Sources

​Scan Document for Images

​Select PDF Parser

​Edit the Selected Parser

​Configure Text-to-SQL for Structured Data

​When to Use Text-to-SQL

​How to Use Text-to-SQL

​1. Set Up Your Data Source

​2. Activate SQL Indexing

​3. Checking Ingestion Status for Structured Files

​4. Use in the Agent

​Text-to-SQL Agent Settings

​Model Selection

​Fuzzy Search

​Configure Vector Database

​Available Options

Scan Document for Images

Select PDF Parser

Edit the Selected Parser

Configure Text-to-SQL for Structured Data

When to Use Text-to-SQL

How to Use Text-to-SQL

1. Set Up Your Data Source

2. Activate SQL Indexing

3. Checking Ingestion Status for Structured Files

4. Use in the Agent

Text-to-SQL Agent Settings

Model Selection

Fuzzy Search

Configure Vector Database

Available Options